This Dataset stores the temperature readings from IoT devices installed outside and inside of an anonymous room. Because the device was in the testing phase, it was uninstalled or shut off several times during the entire reading period, which caused some outliers and missing-values.
Dataset details:We can enjoy finding out the following:
- id : unique IDs for each reading
- room_id/id : room id in which device was installed(currently 'admin room' only for example purpose).
- noted_date : date and time of reading
- temp : temperature readings
- out/in : whether reading was taken from device installed inside or outside of room
- the relationship of inside and outside temperature
- trend or seasonality in the data
- forecasting future temperature by using time-series modeling
- characteristic tendency through year, month, week or day/night
- and so on...
- Practice data cleansing technique
- Practice EDA technique to deal with time-series data
- Series Decomposition into trend/seasonality
- Practice visualising technique
- Practice time-series modeling technique
- Prophet
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import os
from fbprophet import Prophet
from fbprophet.plot import add_changepoints_to_plot
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/temperature-readings-iot-devices/IOT-temp.csv
df = pd.read_csv("/kaggle/input/temperature-readings-iot-devices/IOT-temp.csv")
print(f'IOT-temp.csv : {df.shape}')
df.head(3)
IOT-temp.csv : (97606, 5)
| id | room_id/id | noted_date | temp | out/in | |
|---|---|---|---|---|---|
| 0 | __export__.temp_log_196134_bd201015 | Room Admin | 08-12-2018 09:30 | 29 | In |
| 1 | __export__.temp_log_196131_7bca51bc | Room Admin | 08-12-2018 09:30 | 29 | In |
| 2 | __export__.temp_log_196127_522915e3 | Room Admin | 08-12-2018 09:29 | 41 | Out |
column 'room_id/id' has only one value(Room Admin), so we don't need this column for analysis.
df['room_id/id'].value_counts()
Room Admin 97606 Name: room_id/id, dtype: int64
df.drop('room_id/id', axis=1, inplace=True)
df.head(3)
| id | noted_date | temp | out/in | |
|---|---|---|---|---|
| 0 | __export__.temp_log_196134_bd201015 | 08-12-2018 09:30 | 29 | In |
| 1 | __export__.temp_log_196131_7bca51bc | 08-12-2018 09:30 | 29 | In |
| 2 | __export__.temp_log_196127_522915e3 | 08-12-2018 09:29 | 41 | Out |
changing column names to understand easily
df.rename(columns={'noted_date':'date', 'out/in':'place'}, inplace=True)
df.head(3)
| id | date | temp | place | |
|---|---|---|---|---|
| 0 | __export__.temp_log_196134_bd201015 | 08-12-2018 09:30 | 29 | In |
| 1 | __export__.temp_log_196131_7bca51bc | 08-12-2018 09:30 | 29 | In |
| 2 | __export__.temp_log_196127_522915e3 | 08-12-2018 09:29 | 41 | Out |
datetime column has a lot of information such as year, month, weekday and so on. To utilize these information in EDA and modeling phase, we need extract them from datetime column.
df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y %H:%M')
df['year'] = df['date'].apply(lambda x : x.year)
df['month'] = df['date'].apply(lambda x : x.month)
df['day'] = df['date'].apply(lambda x : x.day)
df['weekday'] = df['date'].apply(lambda x : x.day_name())
df['weekofyear'] = df['date'].apply(lambda x : x.weekofyear)
df['hour'] = df['date'].apply(lambda x : x.hour)
df['minute'] = df['date'].apply(lambda x : x.minute)
df.head(3)
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | __export__.temp_log_196134_bd201015 | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 |
| 1 | __export__.temp_log_196131_7bca51bc | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 |
| 2 | __export__.temp_log_196127_522915e3 | 2018-12-08 09:29:00 | 41 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 29 |
Let's assume this data was collected in India.
According to this wiki page, India has four climatological seasons as below.We can create seasonal variable based on month variable.
- Winter : December to February
- Summer : March to May
- Monsoon : June to September
- Post-monsoon : October to November
The idea came from this notebook.
function to convert month variable into seasons
def month2seasons(x):
if x in [12, 1, 2]:
season = 'Winter'
elif x in [3, 4, 5]:
season = 'Spring'
elif x in [6, 7, 8]:
season = 'Summer'
elif x in [9, 10, 11]:
season = 'Autumn'
return season
df['season'] = df['month'].apply(month2seasons)
df.head(3)
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | __export__.temp_log_196134_bd201015 | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 | Winter |
| 1 | __export__.temp_log_196131_7bca51bc | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 | Winter |
| 2 | __export__.temp_log_196127_522915e3 | 2018-12-08 09:29:00 | 41 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 29 | Winter |
Hour variable can be broken into Night, Morning, Afternoon and Evening based on its number.We can create timing variable based on hour variable.
- Night : 22:00 - 23:59 / 00:00 - 03:59
- Morning : 04:00 - 11:59
- Afternoon : 12:00 - 16:59
- Evening : 17:00 - 21:59
The idea came from this notebook.
def hours2timing(x):
if x in [22,23,0,1,2,3]:
timing = 'Night'
elif x in range(4, 12):
timing = 'Morning'
elif x in range(12, 17):
timing = 'Afternoon'
elif x in range(17, 22):
timing = 'Evening'
else:
timing = 'X'
return timing
df['timing'] = df['hour'].apply(hours2timing)
df.head(3)
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | __export__.temp_log_196134_bd201015 | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 | Winter | Morning |
| 1 | __export__.temp_log_196131_7bca51bc | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 | Winter | Morning |
| 2 | __export__.temp_log_196127_522915e3 | 2018-12-08 09:29:00 | 41 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 29 | Winter | Morning |
Column 'id' seems to have some information related to 'date' column.
Column 'date' doesn't have seconds information, so 'id' may have seconds information or some uniqueness of when the data was collected.
The idea came from this notebook.
After checking whether any record is duplicated, it turned out that there were duplicate records. So we need to put duplicate records into one unique record.
df[df.duplicated()]
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | __export__.temp_log_196108_4a983c7e | 2018-12-08 09:25:00 | 42 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 25 | Winter | Morning |
df[df['id']=='__export__.temp_log_196108_4a983c7e']
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | __export__.temp_log_196108_4a983c7e | 2018-12-08 09:25:00 | 42 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 25 | Winter | Morning |
| 11 | __export__.temp_log_196108_4a983c7e | 2018-12-08 09:25:00 | 42 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 25 | Winter | Morning |
df.drop_duplicates(inplace=True)
df[df.duplicated()]
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing |
|---|
Looking closely at 'id' column, it seemed to have unique values and two decomposable components, numeric and alpha-numeric.
In the case of 'id' of '__export__.temp_log_101144_ff2f0b97', it can be decomposed into two parts.Alpha-numeric part looks impossible to understand, but numeric part may indicate uniqueness or sortability of each records, for example seconds information.
- numeric part : 101144
- alpha-numeric part : ff2f0b97
In the same datetime(2018-09-12 03:09:00), there are many records and unique ids.
df.loc[df['date']=='2018-09-12 03:09:00', ].sort_values(by='id').head(5)
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61229 | __export__.temp_log_101144_ff2f0b97 | 2018-09-12 03:09:00 | 29 | Out | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61258 | __export__.temp_log_101502_172517d2 | 2018-09-12 03:09:00 | 29 | In | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61255 | __export__.temp_log_104868_a5e526b3 | 2018-09-12 03:09:00 | 28 | In | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61231 | __export__.temp_log_108845_062b2592 | 2018-09-12 03:09:00 | 28 | In | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61272 | __export__.temp_log_112303_fca608f4 | 2018-09-12 03:09:00 | 29 | In | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
The count of numeric parts in 'id' have the same number as the length of the entire data, so the numeric parts indicate uniqueness of each records.
df['id'].apply(lambda x : x.split('_')[6]).nunique() == len(df)
True
Adding numeric parts in 'id' as new identifier.
df['id'] = df['id'].apply(lambda x : int(x.split('_')[6]))
df.head(3)
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 196134 | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 | Winter | Morning |
| 1 | 196131 | 2018-12-08 09:30:00 | 29 | In | 2018 | 12 | 8 | Saturday | 49 | 9 | 30 | Winter | Morning |
| 2 | 196127 | 2018-12-08 09:29:00 | 41 | Out | 2018 | 12 | 8 | Saturday | 49 | 9 | 29 | Winter | Morning |
Selecting one unique datetime(2018-09-12 03:09:00) and sorting by 'id', it turned out that there were some gaps in 'id' column.
This fact makes it little difficult to understand mapping of 'id' to 'date'.On the other hand, selecting certain range of 'id'(4000-4010) and sorting by its number, it turned out that there was a gap in 'date' between 'id' 4004 and the others.
- 17003 - 17006 : 17004 and 17005 missing
- 17006 - 17009 : 17007 and 17008 missing
Sorting by 'id', it must have orderliness in 'date'. But in 'id' 4004 'date' have former datetime compared to the previous 'id'.So it can be said that 'id' column is not related to second information, but it can be used as a unique identifier of each records.
- 4002 : 2018-09-09 16:24:00
- 4004 : 2018-09-09 16:23:00
There are gaps in 'id' column.
df.loc[df['date'] == '2018-09-12 03:09:00', ].sort_values(by ='id').head(5)
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61273 | 17002 | 2018-09-12 03:09:00 | 29 | Out | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61275 | 17003 | 2018-09-12 03:09:00 | 28 | Out | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61267 | 17006 | 2018-09-12 03:09:00 | 28 | Out | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61269 | 17009 | 2018-09-12 03:09:00 | 28 | Out | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
| 61271 | 17010 | 2018-09-12 03:09:00 | 29 | Out | 2018 | 9 | 12 | Wednesday | 37 | 3 | 9 | Autumn | Night |
There is a gap in 'date' column when ordered by 'id'.
df.loc[df['id'].isin(range(4000, 4011))].sort_values(by='id')
| id | date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84141 | 4000 | 2018-09-09 16:24:00 | 29 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 24 | Autumn | Afternoon |
| 84142 | 4002 | 2018-09-09 16:24:00 | 29 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 24 | Autumn | Afternoon |
| 84144 | 4004 | 2018-09-09 16:23:00 | 28 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 23 | Autumn | Afternoon |
| 84128 | 4006 | 2018-09-09 16:24:00 | 28 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 24 | Autumn | Afternoon |
| 84132 | 4007 | 2018-09-09 16:24:00 | 29 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 24 | Autumn | Afternoon |
| 84136 | 4009 | 2018-09-09 16:24:00 | 28 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 24 | Autumn | Afternoon |
| 84137 | 4010 | 2018-09-09 16:24:00 | 28 | Out | 2018 | 9 | 9 | Sunday | 36 | 16 | 24 | Autumn | Afternoon |
month_rd = np.round(df['date'].apply(lambda x : x.strftime("%Y-%m")).value_counts(normalize=True).sort_index() * 100,decimals=1)
month_rd_bar = hv.Bars(month_rd).opts(color="green")
month_rd_curve = hv.Curve(month_rd).opts(color="red")
(month_rd_bar * month_rd_curve).opts(title="Monthly Readings Count", xlabel="Month", ylabel="Percentage", yformatter='%d%%', width=700, height=300,tools=['hover'],show_grid=True)
Temperature clearly consists of multiple distributions.
hv.Distribution(df['temp']).opts(title="Temperature Distribution", color="green", xlabel="Temperature", ylabel="Density")\
.opts(opts.Distribution(width=700, height=300,tools=['hover'],show_grid=True))
pl_cnt = np.round(df['place'].value_counts(normalize=True) * 100)
hv.Bars(pl_cnt).opts(title="Readings Place Count", color="green", xlabel="Places", ylabel="Percentage", yformatter='%d%%')\
.opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))
season_cnt = np.round(df['season'].value_counts(normalize=True) * 100)
hv.Bars(season_cnt).opts(title="Season Count", color="green", xlabel="Season", ylabel="Percentage", yformatter='%d%%')\
.opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))
timing_cnt = np.round(df['timing'].value_counts(normalize=True) * 100)
hv.Bars(timing_cnt).opts(title="Timing Count", color="green", xlabel="Timing", ylabel="Percentage", yformatter='%d%%')\
.opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))
in_month = np.round(df[df['place']=='In']['date'].apply(lambda x : x.strftime("%Y-%m")).value_counts(normalize=True).sort_index() * 100, decimals=1)
out_month = np.round(df[df['place']=='Out']['date'].apply(lambda x : x.strftime("%Y-%m")).value_counts(normalize=True).sort_index() * 100, decimals=1)
in_out_month = pd.merge(in_month,out_month,right_index=True,left_index=True).rename(columns={'date_x':'In', 'date_y':'Out'})
in_out_month = pd.melt(in_out_month.reset_index(), ['index']).rename(columns={'index':'Month', 'variable':'Place'})
hv.Bars(in_out_month, ['Month', 'Place'], 'value').opts(opts.Bars(title="Monthly Readings by Place Count", width=700, height=400,tools=['hover'],show_grid=True, ylabel="Count"))
- Inside temperature is composed of a single distribution, while outside temperature is composed of multiple distributions.
- It seems that the temperature inside the room is kept constant by the air conditioner, but the outside temperature is easily affected by time-series factors such as seasons.
(hv.Distribution(df[df['place']=='In']['temp'], label='In') * hv.Distribution(df[df['place']=='Out']['temp'], label='Out'))\
.opts(title="Temperature by Place Distribution", xlabel="Temperature", ylabel="Density")\
.opts(opts.Distribution(width=700, height=300,tools=['hover'],show_grid=True))
season_agg = df.groupby('season').agg({'temp': ['min', 'max']})
season_maxmin = pd.merge(season_agg['temp']['max'],season_agg['temp']['min'],right_index=True,left_index=True)
season_maxmin = pd.melt(season_maxmin.reset_index(), ['season']).rename(columns={'season':'Season', 'variable':'Max/Min'})
hv.Bars(season_maxmin, ['Season', 'Max/Min'], 'value').opts(title="Temperature by Season Max/Min", ylabel="Temperature")\
.opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))
timing_agg = df.groupby('timing').agg({'temp': ['min', 'max']})
timing_maxmin = pd.merge(timing_agg['temp']['max'],timing_agg['temp']['min'],right_index=True,left_index=True)
timing_maxmin = pd.melt(timing_maxmin.reset_index(), ['timing']).rename(columns={'timing':'Timing', 'variable':'Max/Min'})
hv.Bars(timing_maxmin, ['Timing', 'Max/Min'], 'value').opts(title="Temperature by Timing Max/Min", ylabel="Temperature")\
.opts(opts.Bars(width=700, height=300,tools=['hover'],show_grid=True))
It's straightforward applying time-series analysis with unique time-index data. So we need to calculate mean values by 'date' column and delete 'id' column.
tsdf = df.drop_duplicates(subset=['date','place']).sort_values('date').reset_index(drop=True)
tsdf['temp'] = df.groupby(['date','place'])['temp'].mean().values
tsdf.drop('id', axis=1, inplace=True)
tsdf.head(3)
| date | temp | place | year | month | day | weekday | weekofyear | hour | minute | season | timing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2018-07-28 07:06:00 | 31.0 | In | 2018 | 7 | 28 | Saturday | 30 | 7 | 6 | Summer | Morning |
| 1 | 2018-07-28 07:07:00 | 31.0 | Out | 2018 | 7 | 28 | Saturday | 30 | 7 | 7 | Summer | Morning |
| 2 | 2018-07-28 07:07:00 | 32.0 | In | 2018 | 7 | 28 | Saturday | 30 | 7 | 7 | Summer | Morning |
- The outside temperature has a larger time series change than the inside temperature.
- It is thought that the inside temperature is adjusted by air conditioner, but the outside temperature is affected by seasonal temperature fluctuations.
in_month = tsdf[tsdf['place']=='In'].groupby('month').agg({'temp':['mean']})
in_month.columns = [f"{i[0]}_{i[1]}" for i in in_month.columns]
out_month = tsdf[tsdf['place']=='Out'].groupby('month').agg({'temp':['mean']})
out_month.columns = [f"{i[0]}_{i[1]}" for i in out_month.columns]
hv.Curve(in_month, label='In') * hv.Curve(out_month, label='Out').opts(title="Monthly Temperature Mean", ylabel="Temperature", xlabel='Month')\
.opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))
tsdf['daily'] = tsdf['date'].apply(lambda x : pd.to_datetime(x.strftime('%Y-%m-%d')))
in_day = tsdf[tsdf['place']=='In'].groupby(['daily']).agg({'temp':['mean']})
in_day.columns = [f"{i[0]}_{i[1]}" for i in in_day.columns]
out_day = tsdf[tsdf['place']=='Out'].groupby(['daily']).agg({'temp':['mean']})
out_day.columns = [f"{i[0]}_{i[1]}" for i in out_day.columns]
(hv.Curve(in_day, label='In') * hv.Curve(out_day, label='Out')).opts(title="Daily Temperature Mean", ylabel="Temperature", xlabel='Day', shared_axes=False)\
.opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))
in_wd = tsdf[tsdf['place']=='In'].groupby('weekday').agg({'temp':['mean']})
in_wd.columns = [f"{i[0]}_{i[1]}" for i in in_wd.columns]
in_wd['week_num'] = [['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'].index(i) for i in in_wd.index]
in_wd.sort_values('week_num', inplace=True)
in_wd.drop('week_num', axis=1, inplace=True)
out_wd = tsdf[tsdf['place']=='Out'].groupby('weekday').agg({'temp':['mean']})
out_wd.columns = [f"{i[0]}_{i[1]}" for i in out_wd.columns]
out_wd['week_num'] = [['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'].index(i) for i in out_wd.index]
out_wd.sort_values('week_num', inplace=True)
out_wd.drop('week_num', axis=1, inplace=True)
hv.Curve(in_wd, label='In') * hv.Curve(out_wd, label='Out').opts(title="Weekday Temperature Mean", ylabel="Temperature", xlabel='Weekday')\
.opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))
in_wof = tsdf[tsdf['place']=='In'].groupby('weekofyear').agg({'temp':['mean']})
in_wof.columns = [f"{i[0]}_{i[1]}" for i in in_wof.columns]
out_wof = tsdf[tsdf['place']=='Out'].groupby('weekofyear').agg({'temp':['mean']})
out_wof.columns = [f"{i[0]}_{i[1]}" for i in out_wof.columns]
hv.Curve(in_wof, label='In') * hv.Curve(out_wof, label='Out').opts(title="WeekofYear Temperature Mean", ylabel="Temperature", xlabel='WeekofYear')\
.opts(opts.Curve(width=700, height=300,tools=['hover'],show_grid=True))
- Plotting overall data, it is found that there are some missing data points randomly through whole period.
- Interpolating with 'nearest' method looks better(yet far from best), but there are many gaps in interpolated data yet.
in_tsdf = tsdf[tsdf['place']=='In'].reset_index(drop=True)
in_tsdf.index = in_tsdf['date']
in_all = hv.Curve(in_tsdf['temp']).opts(title="[In] Temperature All", ylabel="Temperature", xlabel='Time', color='red')
out_tsdf = tsdf[tsdf['place']=='Out'].reset_index(drop=True)
out_tsdf.index = out_tsdf['date']
out_all = hv.Curve(out_tsdf['temp']).opts(title="[Out] Temperature All", ylabel="Temperature", xlabel='Time', color='blue')
in_tsdf_int = in_tsdf['temp'].resample('1min').interpolate(method='nearest')
in_tsdf_int_all = hv.Curve(in_tsdf_int).opts(title="[In] Temperature All Interpolated with 'nearest'", ylabel="Temperature", xlabel='Time', color='red', fontsize={'title':11})
out_tsdf_int = out_tsdf['temp'].resample('1min').interpolate(method='nearest')
out_tsdf_int_all = hv.Curve(out_tsdf_int).opts(title="[Out] Temperature All Interpolated with 'nearest'", ylabel="Temperature", xlabel='Time', color='blue', fontsize={'title':11})
(in_all + in_tsdf_int_all + out_all + out_tsdf_int_all).opts(opts.Curve(width=400, height=300,tools=['hover'],show_grid=True)).opts(shared_axes=False).cols(2)
- To forecast future temperature, we need to convert data into rough granularity.
- Using interpolated daily mean data seems a good solution.
in_d_org = hv.Curve(in_day).opts(title="[In] Daily Temperature Mean", ylabel="Temperature", xlabel='Time', color='red')
out_d_org = hv.Curve(out_day).opts(title="[Out] Daily Temperature Mean", ylabel="Temperature", xlabel='Time', color='blue')
inp_df = pd.DataFrame()
in_d_inp = in_day.resample('1D').interpolate('spline', order=5)
out_d_inp = out_day.resample('1D').interpolate('spline', order=5)
inp_df['In'] = in_d_inp.temp_mean
inp_df['Out'] = out_d_inp.temp_mean
in_d_inp_g = hv.Curve(inp_df['In']).opts(title="[In] Daily Temperature Mean Interpolated with 'spline'", ylabel="Temperature", xlabel='Time', color='red', fontsize={'title':10})
out_d_inp_g = hv.Curve(inp_df['Out']).opts(title="[Out] Daily Temperature Mean Interpolated with 'spline'", ylabel="Temperature", xlabel='Time', color='blue', fontsize={'title':10})
(in_d_org + in_d_inp_g + out_d_org + out_d_inp_g).opts(opts.Curve(width=400, height=300,tools=['hover'],show_grid=True)).opts(shared_axes=False).cols(2)
Building time-series model to predict future temperature inside/outside room by Prophet.
I chose Prophet this time for time-series modeling tool based on following reasons:
- Automatic detection of trend and seasonality
- Robustness against outliers
- Customizable seasonalities
- No need for fine parameter tuning
In addition to temperature information, I added season information, which is a time-series factor that affects temperature (especially outside).
org_df = inp_df.reset_index()
org_df['season'] = org_df['daily'].apply(lambda x : month2seasons(x.month))
org_df = pd.get_dummies(org_df, columns=['season'])
org_df.head(3)
| daily | In | Out | season_Autumn | season_Summer | season_Winter | |
|---|---|---|---|---|---|---|
| 0 | 2018-07-28 | 31.142857 | 31.691667 | 0 | 1 | 0 |
| 1 | 2018-07-29 | 32.500000 | 31.333333 | 0 | 1 | 0 |
| 2 | 2018-07-30 | 31.866909 | 31.555176 | 0 | 1 | 0 |
def run_prophet(place, prediction_periods, plot_comp=True):
# make dataframe for training
prophet_df = pd.DataFrame()
prophet_df["ds"] = pd.date_range(start=org_df['daily'][0], end=org_df['daily'][133])
prophet_df['y'] = org_df[place]
# add seasonal information
prophet_df['autumn'] = org_df['season_Autumn']
prophet_df['summer'] = org_df['season_Summer']
prophet_df['winter'] = org_df['season_Winter']
# train model by Prophet
m = Prophet(changepoint_prior_scale=0.1, yearly_seasonality=2, weekly_seasonality=False)
# include seasonal periodicity into the model
m.add_seasonality(name='season_Autumn', period=124, fourier_order=5, prior_scale=0.1, condition_name='autumn')
m.add_seasonality(name='season_Summer', period=62, fourier_order=5, prior_scale=0.1, condition_name='summer')
m.add_seasonality(name='season_Winter', period=93, fourier_order=5, prior_scale=0.1, condition_name='winter')
m.fit(prophet_df)
# make dataframe for prediction
future = m.make_future_dataframe(periods=prediction_periods)
# add seasonal information
future_season = pd.get_dummies(future['ds'].apply(lambda x : month2seasons(x.month)))
future['autumn'] = future_season['Autumn']
future['summer'] = future_season['Summer']
future['winter'] = future_season['Winter']
# predict the future temperature
prophe_result = m.predict(future)
# plot prediction
fig1 = m.plot(prophe_result)
ax = fig1.gca()
ax.set_title(f"{place} Prediction", size=25)
ax.set_xlabel("Time", size=15)
ax.set_ylabel("Temperature", size=15)
a = add_changepoints_to_plot(ax, m, prophe_result)
fig1.show()
# plot decomposed timse-series components
if plot_comp:
fig2 = m.plot_components(prophe_result)
fig2.show()
run_prophet('In',30)
run_prophet('Out',30)
- ID column was not used as useful information, but it was used as a unique identifier for each row.
- We got some useful information such as seasonal information or timing information for analysis from datetime column.
- Inside temperature is composed of a single distribution, while outside temperature is composed of multiple distributions.
- Outside temperature can be more affected by seasonal temperature fluctuations than inside temperature.
- So many drops in the data made it difficult to build model, so interpolating daily-mean data by 'spline' method worked.
- Some outliers made it difficult to build forecasting model, but thanks to Prophet it is thought we built robust model against outliers.
- Outside temperature is composed by multiple distribution, while inside temperature has single distribution.
- Inside temperature has flat trend, but outside temperature seems to be affected by time-series factor such as seasonality.
dist = (hv.Distribution(df[df['place']=='In']['temp'], label='In') * hv.Distribution(df[df['place']=='Out']['temp'], label='Out'))\
.opts(title="Temperature by Place Distribution", xlabel="Temperature", ylabel="Density",tools=['hover'],show_grid=True, fontsize={'title':11})
tsdf['daily'] = tsdf['date'].apply(lambda x : pd.to_datetime(x.strftime('%Y-%m-%d')))
in_day = tsdf[tsdf['place']=='In'].groupby(['daily']).agg({'temp':['mean']})
in_day.columns = [f"{i[0]}_{i[1]}" for i in in_day.columns]
out_day = tsdf[tsdf['place']=='Out'].groupby(['daily']).agg({'temp':['mean']})
out_day.columns = [f"{i[0]}_{i[1]}" for i in out_day.columns]
curve = (hv.Curve(in_day, label='In') * hv.Curve(out_day, label='Out')).opts(title="Daily Temperature Mean", ylabel="Temperature", xlabel='Day', shared_axes=False,tools=['hover'],show_grid=True)
(dist + curve).opts(width=400, height=300)
- As shown below, outside temperature has larger variance than inside temperature.
in_var = hv.Violin(org_df['In'].values, vdims='Temperature').opts(title="In Temperature Variance", box_color='red')
out_var = hv.Violin(org_df['Out'].values, vdims='Temperature').opts(title="Out Temperature Variance", box_color='blue')
(in_var + out_var).opts(opts.Violin(width=400, height=300,show_grid=True))
- We built the forecasting model by using Prophet.
- Predicting next 30 points(about a month), it looks that the model generated future points at a certain accuracy.
run_prophet('In',30, False)
run_prophet('Out',30, False)
- Prophe by Facebook
https://facebook.github.io/prophet/docs/quick_start.html